Challenge 5

challenge_5
psc
tidyverse
ggplot2
summarytools
Intro to Visualization
Author

Saaradhaa M

Published

August 22, 2022

library(tidyverse)
library(ggplot2)
library(summarytools)

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)

Read in data

I’ve previously worked with the households dataset, so I’ll be working with the Public School Characteristics dataset for this challenge. Using complete(), I’ll set all missing values to NA.

# read in dataset.
psc <- read_csv("_data/Public_School_Characteristics_2017-18.csv")

# fill in NA.
complete(psc)
# view dataset.
psc

Briefly describe data

The dataset has 100,729 rows and 79 columns. I found a codebook here. This dataset describes the characteristics of public schools from 2017 to 2018 by the National Center for Education Statistics. There is a mix of numeric, character and logical columns.

The data definitely needs to be tidied! Due to time constraints and how massive the dataset is, I’ll first identify the visualisations I want to produce using the codebook. I’ll then work on a subset of the original dataset. I specifically want to look at the following:

  • Univariate: American Indian

  • Univariate: Black

  • Bivariate: state, American Indian

Tidy data

The variables I need are AM, BL and LSTATE.

# creating subset.
subset <- psc %>% select(AM, BL, LSTATE)

# sanity check.
subset

We need to change STATE to a factor.

# change type to factor.
subset$LSTATE <- as.factor(subset$LSTATE)

# view dataset using summarytools.
print(summarytools::dfSummary(subset, varnumbers = FALSE, plain.ascii = FALSE, graph.magnif = 0.50, style = "grid", valid.col = FALSE), 
      method = 'render', table.classes = 'table-condensed')

Data Frame Summary

subset

Dimensions: 100729 x 3
Duplicates: 62955
Variable Stats / Values Freqs (% of Valid) Graph Missing
AM [numeric]
Mean (sd) : 6.7 (30.3)
min ≤ med ≤ max:
0 ≤ 1 ≤ 1395
IQR (CV) : 4 (4.5)
424 distinct values 20609 (20.5%)
BL [numeric]
Mean (sd) : 83 (151.4)
min ≤ med ≤ max:
0 ≤ 19 ≤ 5088
IQR (CV) : 90 (1.8)
1166 distinct values 8325 (8.3%)
LSTATE [factor]
1. AK
2. AL
3. AR
4. AS
5. AZ
6. CA
7. CO
8. CT
9. DC
10. DE
[ 45 others ]
513(0.5%)
1478(1.5%)
1087(1.1%)
28(0.0%)
2465(2.4%)
10325(10.3%)
1900(1.9%)
1031(1.0%)
227(0.2%)
229(0.2%)
81446(80.9%)
0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.1)
2022-08-23

Normally, I would pivot ethnicity into 1 column, but we don’t want to do that so that we can run the visualisations. We also have some missing values. This is expected, as not all students may wish to report their ethnicity, or some school districts may have opted not to share this data, etc.

Univariate Visualizations

The univariate visualizations I want to work on are the distributions of students of American Indian and Black ethnicity. These are numeric variables, so we should use histograms.

subset %>% filter(! is.na(AM)) %>% ggplot(aes(AM)) + geom_histogram() + theme_minimal() + labs(title = "American Indian Students in Public Schools (2017-2018)")

subset %>% filter(! is.na(BL)) %>% ggplot(aes(BL)) + geom_histogram() + theme_minimal() + labs(title = "Black Students in Public Schools (2017-2018)")

Bivariate Visualization

I want to plot American Indian (numeric) against state (categorical).

subset %>% filter(! is.na(LSTATE)) %>% ggplot(aes(LSTATE, AM)) + geom_boxplot() + theme_minimal() + labs(title = "American Indian Students in Public Schools By State (2017-2018)", x = "Number of American Indian Students", y = "State") + theme(axis.text.x=element_text(angle=90,hjust=1))

I notice an error in my boxplot - I should have summed up the number of American Indian students by state before doing this! It’s close to 11pm so I’m turning this in first - let me try to fix this within the week.